Automated words stability and languages phylogeny

نویسندگان

  • Filippo Petroni
  • Maurizio Serva
چکیده

The idea of measuring distance between languages seems to have its roots in the work of the French explorer Dumont D'Urville (D'Urville 1832). He collected comparative words lists of various languages during his voyages aboard the Astrolabe from 1826 to1829 and, in his work about the geographical division of the Pacific, he proposed a method to measure the degree of relation among languages. The method used by modern glottochronology, developed by Morris Swadesh in the 1950s (Swadesh 1952), measures distances from the percentage of shared cognates, which are words with a common historical origin. Recently, we proposed a new automated method which uses normalized Levenshtein distance among words with the same meaning and averages on the words contained in a list. Another classical problem in glottochronology is the study of the stability of words corresponding to different meanings. Words, in fact, evolve because of lexical changes, borrowings and replacement at a rate which is not the same for all of them. The speed of lexical evolution is different for different meanings and it is probably related to the frequency of use of the associated words (Pagel et al. 2007). This problem is tackled here by an automated methodology only based on normalized Levenshtein distance. I TRODUCTIO Glottochronology tries to estimate the time at which languages diverged with the implicit assumption that vocabularies change at a constant average rate. The concept seems to have its roots in the work of the French explorer Dumont D'Urville. He collected comparative words lists of various languages during his voyages aboard the Astrolabe from 1826 to 1829 and, in his work about the geographical division of the Pacific (D'Urville 1832) he introduced the concept of lexical cognates and proposed a method to measure the degree of relation among languages. He used a core vocabulary of 115 basic terms which, impressively, contains all but three the terms of the Swadesh 100-item list. Then, he assigned a distance from 0 to 1 to any pair of words with same meaning and finally he was able to resolve the relationship for any pair of languages. His conclusion is famous: La langue est partout la meme. The method used by modern glottochronology, was developed by Morris Swadesh (Swadesh 1952) in the 1950s. The idea is to consider the percentage of shared cognates in order to compute the distance between pairs of languages. These lexical distances are assumed to be, on average, logarithmically proportional to divergence times. In fact, changes in vocabulary accumulate year after year and two languages initially similar become more and more different. A recent example of the use of Swadesh lists and cognates to construct language trees are the studies of Gray and Atkinson (Gray and Atkinson 2003) and Gray and Jordan (Gray and Jordan (2000)). We recently proposed an automated method which uses Levenshtein distance among words in a list (Serva and Petroni 2008, Petroni and Serva 2008). To be precise, we defined the distance of two languages by considering a normalized Levenshtein distance among words with the same meaning and we averaged on all the words contained in a list 1 . The normalization, which takes into account word length, plays a crucial role, and no sensible results would have been found without. We applied our method to the Indo-European and the Austronesian groups considering, in both cases, fifty different languages (Serva and Petroni 2008, Petroni and Serva 2008). Almost at the same time, the above described automated method was used and developed by another large group of scholars (Bakker et al. 2008, Holman et al. 2008). In their work, they used lists of 40 words while we used lists of 200. Their choice was taken according to a careful study of the stability of different words (Wichmann 2009). Another classical problem in glottochronology is the study of the stability of words corresponding to different meanings. Words in fact, evolve because of lexical changes, borrowings and replacement at a rate which is not the same for all of them. The speed of lexical evolution, is different for different meanings and it is probably related to the frequency of use of the associated words (Pagel et al. 2007) . The study of words stability has an interest in itself since it may give strong information on the activities which are at the core of the behavior of a social or ethnic group but it is also necessary for a proper choice of the imput lists for language comparisons. The idea of inferring the stability of an item from its similarity in related languages goes back a long way in the lexicostatistical literature (Thomas 1960, Kroeber 1963, Oswalt 1971). In this paper we tackle this problem with an automated methodology based on normalized Levenshtein distance. To reach the goal, it is necessary to obtain a measure of the typical distance of all pairs of words corresponding to a given meaning in a language family. The distance between words is computed as in (Serva and Petroni 2008, Petroni and Serva 2008) avoiding the use of cognates. For any meaning, and any language family, we are able to find a number which measure its stability (or rate of evolution) in a completely objective and reproducible manner. In the next section we define the lexical distance between words. Section 3 is the core of the paper, there we define the automated stability of the meanings and we study the distribution and ranking of stability for Indo-European family and for Austronesian one. In section 4 we compare the stability ranking of items. Conclusions and outlook are in section 5. LEXICAL DISTA CE Our definition of lexical distance between two words is a variant of the Levenshtein distance which is simply the minimum number of insertions, deletions, or substitutions of a single character needed to transform one word into the other. Our definition is taken as the Levenshtein distance divided by the number of characters of the longer of the two compared words. More precisely, given two words αi and βi corresponding to the same item i in two languages α and β, their distance D(αi,βi) is given by ( ) ( ) ( ) 1 , , , i i

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Automated Word Stability and Language Phylogeny

The idea of measuring distance between languages seems to have its roots in the work of the French explorer Dumont D’Urville (1832). He collected comparative word lists of various languages during his voyages aboard the Astrolabe from 1826 to1829 and, in his work about the geographical division of the Pacific, he proposed a method to measure the degree of relationship among languages. The metho...

متن کامل

Lexical evolution rates by automated stability measure

Phylogenetic trees can be reconstructed from the matrix which contains the distances between all pairs of languages in a family. Recently, we proposed a new method which uses normalized Levenshtein distances among words with same meaning and averages on all the items of a given list. Decisions about the number of items in the input lists for language comparison have been debated since the begin...

متن کامل

Phylogeny and geometry of languages from normalized Levenshtein distance

The idea that the distance among pairs of languages can be evaluated from lexical differences seems to have its roots in the work of the French explorer Dumont D’Urville. He collected comparative words lists of various languages during his voyages aboard the Astrolabe from 1826 to 1829 and, in his work about the geographical division of the Pacific, he proposed a method to measure the degree of...

متن کامل

Automated languages phylogeny from Levenshtein distance

Languages evolve in time according to a process in which reproduction, mutation and extinction are all possible. This is very similar to haploid evolution for asexual organisms or for mtDNA of complex ones. Exploiting this similarity it is possible, in principle, to verify hypotheses concerning their relationship. The key point is the definition of the distance among pairs of languages in analo...

متن کامل

Latent Semantic Analysis of the Languages of Life

We use Latent Semantic Analysis as a basis to study the languages of life. Using this approach we derive techniques to discover latent relationships between organisms such as significant motifs and evolutionary features. Doubly Singular Value Decomposition is defined and the significance of this adaptation is demonstrated by finding a phylogeny of twenty prokaryotes. Minimal Killer Words are us...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/0911.3292  شماره 

صفحات  -

تاریخ انتشار 2009